if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE)){
BiocManager::install("GEOmetadb")
}
if (!require("knitr", quietly = TRUE))
install.packages("knitr")
if (! requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)){
BiocManager::install("ComplexHeatmap")
}
if (!require("kableExtra", quietly = TRUE))
install.packages("kableExtra")
if (! requireNamespace("RCurl", quietly = TRUE)) {
BiocManager::install("RCurl")
}
if (!require("GSA", quietly = TRUE))
install.packages("GSA")
if (!require("VennDiagram")) {
install.packages("VennDiagram")
}
library(GEOmetadb)
library(knitr)
library(edgeR)
library(ComplexHeatmap)
library(circlize)
library(kableExtra)
library(RCurl)
library(GSA)
library(VennDiagram)
The choice of dataset is GSE85995, the data file was download by using GEOmetadb package method.
In assignment 1, clean process was performed to remove duplicated, non-informative and weakly express data. Mapping process was performed by mappping row identifiers into the most updated HGNC symbols. The cleaned and mapped data was saved into gata3_cleaned_mapped_data.rds file. More detail code algorithm was presented in A1.Rmd file.
sfiles = getGEOSuppFiles('GSE85995')
# <clean and map, the detailed code are in "A1.Rmd"> ...
if (!file.exists("gata3_cleaned_mapped_data.rds")) {
options(knitr.duplicate.label = 'allow')
source(purl("A1.Rmd", output = tempfile()))
}
gata3_cleaned_maped_data <- readRDS("gata3_cleaned_mapped_data.rds")
filtered_data_matrix <- as.matrix(gata3_cleaned_maped_data[,3:ncol(gata3_cleaned_maped_data)])
rownames(filtered_data_matrix) <- gata3_cleaned_maped_data$hgnc_symbols
group = c(rep("scrambled siRNA",3),rep("GATA3 siRNA",3))
d = DGEList(counts=filtered_data_matrix, group = group)
d = calcNormFactors(d)
normalized_cleaned_gata3 <- cpm(d)
# before
data2plot <- log2(cpm(gata3_cleaned_maped_data[,3:ncol(gata3_cleaned_maped_data)]))
counts_density <- apply(data2plot, 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",
main = "Figure 1.1: Pre-normalization Density Plot",
cex.lab = 0.85)
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
# after
normalized_data2plot <- log2(normalized_cleaned_gata3)
normalized_counts_density <- apply(normalized_data2plot, 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_counts_density)) {
xlim <- range(c(xlim, normalized_counts_density[[i]]$x));
ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}
cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",
main = "Figure 1.2: Post-normalization Density Plot",
cex.lab = 0.85)
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(normalized_data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
Figure 1: GATA3 RNASeq Samples Density Plots Comparison by using TMM Normalization.
if (!file.exists("gata3_cleaned_normalized.rds")) {
saveRDS(normalized_cleaned_gata3, "gata3_cleaned_normalized.rds")
}
Figure 2: Initial heat map for the normalized GATA3 data
Figure 3: GATA3 RNASeq Samples MDS Plots Comparison by using Two Models
Two models were designed: simple (cell type only) and complex (cell type and patient variation). The lmFit() method in the limma package was applied to perform the Benjamini-Hochberg FDR multiple hypothesis testing at the threshold of 0.05. The threshold value is also the same value used in the original paper.
According to the results in Figure 4, the gene of interest was selected in both models with close to 0 p-values. The majority of genes stood not significant. For the differential expressed gene (DEG) from both models, the p-values are fundamentally below the threshold of 0.05.
The complex model was used for the edgeR QLF test additionally. Similar to the result from limma, the edge results also have a large portion of non-significant genes as the grey region in Figure 5. But the amount of differentially expressed genes selected by the edgeR method was less than that of limma. GATA3 was also selected by the edgeR method with an expected low p-value.
To quantitatively summarize the results from both methods:
In the limma method, there were 6378 DEGs with the threshold of 0.05, which is 51.17% of genes in the cleaned and mapped data.
In the edgeR method, there were 1419 DEGs with the threshold of 0.05, which is 11.38% of genes in the cleaned and mapped data.
After the p-value correction by the Benjamini-Hochberg method, 5311 genes passed in the limma method, which has the 42.61% of the selected differentially expressed genes. From the chosen model in the edgeR method, 452 genes passed the correction. The percentage was 3.63%.
The volcano plot in Figure 6 displays the division of up-regulated and down-regulated genes from the edgeR result. A clear separation for up-regulated and down-regulated genes occurred based on the logFC.
The top hits from the edgeR results were used to generate the heat map in Figure 7. The graph presents clustering based on cell type. The clustered patterning is more clear compared to the initial heatmap.
Figure 7: Normalized GATA3 RNASeq Samples Heat Maps by using Complex Model in edgeR
qlf_output_hits_withgn <- merge(rownames(gata3_data),qlf_output_hits, by.x=1, by.y = 0)
colnames(qlf_output_hits_withgn)[1] <- "hgnc_symbol"
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) *
sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
up_regulated <- qlf_output_hits_withgn$hgnc_symbol[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
down_regulated <- qlf_output_hits_withgn$hgnc_symbol[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
if(!file.exists("gata3_upregulated_genes.txt")){
write.table(x=up_regulated,
file=file.path("data","gata3_upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
}
if(!file.exists("gata3_downregulated_genes.txt")){
write.table(x=down_regulated,
file=file.path("data","gata3_downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
}
if(!file.exists("gata3_ranked_genelist.txt")){
write.table(x=data.frame(genename= qlf_output_hits_withgn$hgnc_symbol,
F_stat=qlf_output_hits_withgn$rank),
file=file.path("data","gata3_ranked_genelist.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
}
| Resource | Top Term | \(T\) | \(Q\) | \(T \cap Q\) |
|---|---|---|---|---|
| Go: BP | regulation of chromosome organization | 192 | 834 | 30 |
| REAC | Mitotic Prometaphase | 199 | 566 | 35 |
| WP | Retinoblastoma gene in cancer | 88 | 466 | 32 |
| Resource | Top Term | \(T\) | \(Q\) | \(T \cap Q\) |
|---|---|---|---|---|
| Go: BP | ribosomal large subunit biogenesis | 72 | 527 | 18 |
| REAC | rRNA processing in the nucleus and cytosol | 192 | 399 | 40 |
| WP | Metabolic reprogramming in colon cancer | 44 | 341 | 16 |
| Resource | Top Term | \(T\) | \(Q\) | \(T \cap Q\) |
|---|---|---|---|---|
| Go: BP | regulation of chromosome organization | 192 | 1361 | 41 |
| REAC | rRNA processing in the nucleus and cytosol | 192 | 965 | 47 |
| WP | Retinoblastoma gene in cancer | 88 | 787 | 36 |
Full list of the query results:
The ranked list saved in the assignment 2 was used as rank list input in non-threshold gene sets enrichment analysis.
Firstly, I converted the ranked txt file into an rnk file as valid GSEA input.
txt_file <- read.delim("data/gata3_ranked_genelist.txt", header = FALSE)
write.table(x=txt_file, file=file.path("data","gata3_ranked_genelist.rnk"),sep = "\t",
row.names = FALSE,col.names = c('GeneName','rank'),quote = FALSE)
# go to the current_release for finding the latest genesets for human
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest <- file.path(paste(getwd(),'data', sep="/"),gmt_file)
download.file(paste(gmt_url,gmt_file,sep=""), destfile=dest)
The parameters used are:
The GSEA was performed externally by using the above files and configurations in the GSEA software.
The initial html of the GSEA results: index.html
I used GSEA Preranked method in GSEA 4.2.3 software.
The geneset database used was the latest version of the human symbol from the Bader lab gene sets collections (Human_GOBP_AllPathways_no_GO_iea_March_01_2022_symbol.gmt) (“Enrichment Map Gene Sets” 2016). The query used the getURL() method in the RCurl package to obtain the file (Temple Lang 2022). Applied the regex expression to find the latest updated file including GO biological pathway dataset and download it into the data folder for use.
The parameter for setting GSEA Preranked task was listed on above.
The reason for regulating the geneset size range from 200 to 15 is to obtain more specific meaningful pathways and avoid extreme values affecting the results. The choice of not collapse is due to the uniqueness of all gene symbols in the ranked list file. In the results generated by assignment 1, all hgnc symbols saved in the normalized data were verified to be unique. This setting is also the recommended parameter setting for a ranked gene list with unique human gene symbols in the user guide of GSEA (Subramanian et al. 2005).
The top genesets returned summarized table is:
|
Type |
Top geneset |
Size |
P-value |
ES |
NES |
FDR |
Top gene |
|---|---|---|---|---|---|---|---|
|
na_pos |
CYTOSKELETON-DEPENDENT CYTOKINESIS%GOBP%GO:0061640 |
88 |
0 |
0.69 |
2.03 |
0.014 |
STMN1 |
|
na_neg |
VALIDATED TARGETS OF C-MYC TRANSCRIPTIONAL ACTIVATION%PATHWAY INTERACTION DATABASE NCI-NATURE CURATED DATA%VALIDATED TARGETS OF C-MYC TRANSCRIPTIONAL ACTIVATION |
75 |
0 |
-0.79 |
-2.24 |
0.000 |
UBTF |
Compared with the assignment 2 results, the top genesets returned in the enrichment analysis were not identical. For the up-regulated genes, the second-ranked geneset (MITOTIC PROMETAPHASE%REACTOME DATABASE ID RELEASE 79%68877) was the top geneset in the assignment 2 Reactome result. The top geneset generated from the GSEA result was the 12th of GO:BP results in assignment 2. Similar to the down-regulated genes, the third GSEA ranked geneset (rRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME%R-HSA-8868773.3) was the top hit in the assignment 2 Reactome result.
This comparison is straightforward. The ranked gene content imported in the enrichment analysis is the same as the normalized data but additionally assigned with ranking scores. The annotation source of GO: BP is also commonly applied in both analyses. Also in the parameter settings, I used the same range of 15 to 200 to limit the genome size in both threshold and non-threshold analyses. However, the top hit genesets are relevant but not identical.
The enrichment map was created externally using Cytoscape software (Shannon et al. 2003) and EnrichmentMap Cytoscape App (Merico et al. 2010). The network visualization pictures were saved into the data folder.
Cytoscape 3.9.1 and EnrichmentMap Cytoscape App 3.3.3 were used to perform the GESA results visualization.
The parameter for generating the enrichment map is:
186 gene sets (nodes) and 496 overlaps between gene sets (edges) present on the network.
The picture of current network: network.
The annotation setup was created externally using AutoAnnotate App (Kucera et al. 2016) in Cytoscape software.
The annotation used was default from AutoAnnotate app 1.3.5 in Cytoscape. The choice of cluster annotation model is MCL Cluster Annotation Set by default. The label column is by GS_DESCR (gene description). Create singleton clusters and layout network to prevent cluster overlap are selected.
The screenshot of annotated network: annotated_network
Figure 8: Publication ready figure for enrichment results
The two largest clusters of the down-regulated genesets on the network are process purine biosynthetic and major rRNA nucleus. They contain 23 nodes and 14 nodes respectively. On the other side, protein centrosome mitotic was the largest up-regulated genesets cluster in Figure 8. It contains 10 nodes. These clusters would be considered a theme network.
Comparing to the g:profiler results, the major rRNA nucleus cluster contain the down-regulated geneset top hits from Reactome (rRNA processing in the nucleus and cytosol REAC:R-HSA-8868773) and GO:BP (ribosomal large subunit biogenesis GO:0042273). Also in the up-regulated genesets, the Reactome top hit (Mitotic prometaphase REAC:R-HSA-68877) was included in the protein centrosome mitotic cluster.
A typical novel pathway would be tubule development morphogenesis, which was not present in the g:profiler results.
The original paper performed DEG analysis and cellular function analysis network for GATA3 knockdown trophoblasts. The conclusion is the regulation of GATA3 for trophoblast cells in migration and invasion aspects (Lee et al. 2016). The non-threshold methods results contain the tubule development morphogenesis cluster. The largest geneset in the cluster, tissue morphogenesis, includes GATA3 and MYC at the top leading genes, which conform to the conclusion of MYC being one of the down-regulated genes mentioned in the paper. Most of the pathways in this cluster also contain MYC as leading genes.
The top hits obtained from the two methods were different genesets, but the several top hits from the GSEA results contain the top hits from the threshold method results. New pathways have also emerged in network visualization, such as tubule development morphogenesis.
The major rRNA nucleus cluster is one of the major themes in the network. The highest-ranked gene in this cluster is eIF4A. The evidence I found is the correlation of eIF4A for trophoblast cells activities. It was reported to involve the restricted global protein synthesis in trophoblast cells by interacting with AMOT (Enrichment of angiomotin). Over AMOT expression in the placenta was associated with intrauterine growth restriction in both rats and humans (Basak et al. 2020).
The evidence could provide the involvement of major rRNA nucleus clusters in the effect of trophoblast cells activities.
# use the same gene database
gmt_file <- file.path("data", "Human_GOBP_AllPathways_no_GO_iea_March_01_2022_symbol.gmt")
capture.output(genesets <- GSA.read.gmt(gmt_file), file = "gsa_loud.out")
names(genesets$genesets) <- genesets$geneset.names
# load expression file and rank scores
expression <- readRDS("gata3_cleaned_normalized.rds")
ranks <- ranks <- read.table(file.path(getwd(), "data", "gata3_ranked_genelist.txt"),
header = FALSE, sep = "\t", quote = "\"",
stringsAsFactors = FALSE)
# get gsea result files
gseaDirectories <- list.files(path = file.path(getwd(),"data/gsea_results"),
pattern = "\\.GseaPreranked")
if(length(gseaDirectories) == 1){
gseaDir <- file.path(getwd(),"data/gsea_results")
gseaResultsFiles <- list.files(path = gseaDir,
pattern = "gsea_report_*.*.tsv")
enrFile1 <- read.table(file.path(gseaDir,gseaResultsFiles[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
enrFile2 <- read.table(file.path(gseaDir,gseaResultsFiles[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
}
FDR_threshold <- 0.001
all_sig_enr_genesets<- c(rownames(enrFile1)[which(enrFile1[,"FDR.q.val"]<=FDR_threshold)], rownames(enrFile2)[which(enrFile2[,"FDR.q.val"]<=FDR_threshold)])
genes_sig_enr_gs <- c()
for(i in 1:length(all_sig_enr_genesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% all_sig_enr_genesets[i])])
genes_sig_enr_gs <- union(genes_sig_enr_gs, current_geneset)
}
genes_all_gs <- unique(unlist(genesets$genesets))
A <- genes_all_gs # all genes in the gmt file
B <- genes_sig_enr_gs # all genes in the enrichment result
C <- rownames(expression) # all genes in the expression
venn_diagram_path <- file.path(getwd(),"data","dark_matter_overlaps.png")
png(venn_diagram_path)
draw.triple.venn( area1=length(A), area2=length(B), area3 = length(C),
n12 = length(intersect(A,B)), n13=length(intersect(A,C)),
n23 = length(intersect(B,C)),
n123 = length(intersect(A,intersect(B,C))),
category = c("all genesets","all enrichment results","expression"),
fill = c("red","green","blue"),
cat.col = c("red","green","blue"))
dev.off()
Figure 9: Dark Matter Overlap Venn Diagram. The
# genes without annotation
genes_no_annotation <- setdiff(C, A)
#purple area: 1553 genes
length(genes_no_annotation)
## [1] 1553
ranked_gene_no_annotation <- ranks[which(ranks[,1] %in% genes_no_annotation),]
kable(ranked_gene_no_annotation[1:10,], format = 'html', caption = "Table 3.1: The Top 10 ranked genes with no annotation", col.names = c("Gene","Rank score"), escape=FALSE,align=c(rep('c',times=2)))%>%
kable_styling(full_width = T)
| Gene | Rank score | |
|---|---|---|
| 45 | RCN2 | -8.051095 |
| 130 | MROH6 | -4.092775 |
| 177 | CCDC86 | -3.315124 |
| 302 | DANCR | -2.103717 |
| 322 | NT5DC2 | -2.020946 |
| 326 | YDJC | -2.008087 |
| 331 | MIR137HG | -1.989380 |
| 353 | CTXN1 | -1.886112 |
| 365 | NME1-NME2 | -1.839917 |
| 423 | HPCAL1 | -1.601521 |
12.4598845% of genes were not found annotation, which is not a large portion and expected. Due to the normalization and mapping, not too much genes were not included in the latest version of gmt database.
The top gene RCN2 is shown to be associated with Reticulocalbin-2 protein which has GO annotation from GO_Central Uniprot query for RCN2. If we zoom in on the regulatory role of RCN2, it has been shown to regulate blood pressure and contribute to hypertension by affecting the endothelial NO synthases (chang_yan_yan_zheng_liu_2019?). It might not have strong correlations with trophoblast cells than the genesets included in the results.
The second gene MROH6 (Maestro Heat Like Repeat Family Member 6) has no annotation (Uniprot query for MROH6), which conforms to the feature of this group (no annotation). It is associated with autosomal recessive non-syndromic intellectual disability and annotated with binding in GO (mroh6_database?).
Figure 10: Heat map of genes with no annotation
# genes without annotation and in enrichment analysis
genes_annotated_enr_non_sig <- setdiff(C, B)
# the purple and fuchsia areas: 1553 + 10395 genes
length(genes_annotated_enr_non_sig)
## [1] 11948
ranked_genes_annotated_enr_non_sig <- ranks[which(ranks[,1] %in% genes_annotated_enr_non_sig),]
kable(ranked_genes_annotated_enr_non_sig[1:10,], format = 'html', caption = "Table 3.2: The Top 10 ranked genes with annotation and not significant in enrichment analysis", col.names = c("Gene","Rank score"), escape=FALSE,align=c(rep('c',times=2)))%>%
kable_styling(full_width = T)
| Gene | Rank score | |
|---|---|---|
| 1 | TGM2 | -27.21846 |
| 2 | CCN1 | -27.00223 |
| 4 | TCEA1 | -22.05397 |
| 5 | KRT18 | -20.42065 |
| 7 | KRT8 | -17.27867 |
| 8 | KRTAP2-3 | -16.10960 |
| 9 | SPARC | -15.87424 |
| 10 | EIF5A | -15.68808 |
| 11 | CLDN11 | -15.66221 |
| 14 | ZFP36L1 | -14.43692 |
95.860077% of genes were evaluated as non-significant in enrichment results but present in expression, which is a large portion.
Since the use of no_GO_iea version of gmt file as the database, the large portion of genes in this category might be due to the limitation of the annotation source. We could perform another enrichment analysis with a more election source tolerated database.
The top gene TGM2 has a list of annotation sources including Uniprot and Ensembl (Uniprot query for TGM2). It is related to the Protein-glutamine gamma-glutamyltransferase 2 protein involved in multiple biological process including bone development, angiogenesis, wound healing, cellular differentiation, chromatin modification and apoptosis (Bioinformatics 2022). The reason for being excluded might due to the board correlation.
The second gene CCN1 has extracellular matrix binding function annotated in Ensembl (Uniprot query for CCN1). The missing of CCN1 might be due to the annotation source choice.
Figure 11: Heatmap for genes with annotation but not in enrichment results
In the significant genes that are not annotated to any of the pathways in the enrichment analysis, the pattern of heat map is not clear as the heat map in assignment 2 results. These genes are no
In the significant genes that are not annotated and returned in enrichment analysis, the pattern of heatmap is more clear and approach to the assignment 2 heatmap results.